Explainable artificial intelligence approaches for COVID-19 prognosis prediction using clinical markers

The COVID-19 influenza emerged and proved to be fatal, causing millions of deaths worldwide. Vaccines were eventually discovered, effectively preventing the severe symptoms caused by the disease. However, some of the population (elderly and patients with comorbidities) are still vulnerable to severe symptoms such as breathlessness and chest pain. Identifying these patients in advance is imperative to prevent a bad prognosis. Hence, machine learning and deep learning algorithms have been used for early COVID-19 severity prediction using clinical and laboratory markers. The COVID-19 data was collected from two Manipal hospitals after obtaining ethical clearance. Multiple nature-inspired feature selection algorithms are used to choose the most crucial markers. A maximum testing accuracy of 95% was achieved by the classifiers. The predictions obtained by the classifiers have been demystified using five explainable artificial intelligence techniques (XAI). According to XAI, the most important markers are c-reactive protein, basophils, lymphocytes, albumin, D-Dimer and neutrophils. The models could be deployed in various healthcare facilities to predict COVID-19 severity in advance so that appropriate treatments could be provided to mitigate a severe prognosis. The computer aided diagnostic method can also aid the healthcare professionals and ease the burden on already suffering healthcare infrastructure.

• Descriptive statistical analysis of the data has been conducted to understand various trends and patterns in the data.• Fourteen feature selection methods including nature-inspired algorithms have been used to choose the most important markers.• Machine learning models including bagging, boosting, voting and stacking have been used to predict COVID- 19 severity.The classifiers have been further compared to with the state-of-the-art deep learning models such as deep neural network (DNN), one-dimensional convolutional neural network (1D-CNN) and Long short-term memory (LSTM).• Five XAI techniques have been used to interpret the predictions such as SHAP, LIME, Eli5, QLattice and Anchor.
• Further discussion about crucial COVID-19 prognostic markers from a medical perspective.
The reminder of the paper is structured as follows.Materials and methods are described in "Methods" section.Extensive explanation of the results is made in "Results" section.The discussion of the results obtained is made in "Discussion" section.The article concludes in "Conclusion" section.

Description of the dataset
The COVID-19 datasets were obtained from two Hospitals in India: Dr TMA Pai Hospital and Kasturba Medical College.The Manipal Academy of Higher Education has provided ethical clearance to conduct this research (IEC:613/2021).The patients have been completely anonymized in this study.COVID-19 patients who were tested between September 2021 and December 2021 have been considered in this study.Only patients above eighteen years of age have been included.Records of 899 patients have been utilized to train the machine learning models.The dataset included 599 non-severe patients and 300 severe patients.All patients whose condition deteriorated and required admission to the intensive care unit (ICU) and if the respiratory rate > 30/minute or SpO2 < 90% (World Health Organization standards) were grouped as severe cases 16 .Thirty-two clinical parameters were considered in this study (31-continuous and one categorical).The clinical markers chosen are tabulated in Table 1.

Data pre-processing
Pre-processing of the dataset is critical in machine learning.Missing values are imputed, categorical attributes are encoded, continuous values are scaled, data balancing is performed, and unnecessary attributes are dropped.In order to make sure that there are as few missing values as possible, we chose patients who completed the most clinical tests when gathering data.A few missing values in the dataset were replaced by their respective median.The "gender" attribute (categorical) had no missing values.Descriptive statistical analysis was conducted using the open-source statistical software Jamovi.Some statistical parameters utilized are described in Table 2.
Violin plots were used to find interesting patterns in the dataset, as shown in Fig. 1.From the figure, it can be seen that the median age was elevated in the severe COVID-19 cohort.Further, markers such as Neutrophils, HbA1c and CRP were elevated in severe patients.The lymphocytes and monocytes count decreased in the severe COVID-19 cohort.
The frequency of the "gender" attribute for severe/non-severe COVID-19 patients is described using a bar plot in Fig. 2.There were 347 male and 252 female patients in the non-severe cohort.There were 204 male patients and 96 female patients in the severe cohort.
In machine learning analysis, categorical values must be encoded since the classifiers do not handle text values.Several encoding techniques exist in machine learning 17 .In this study, we used the one-hot encoding technique to encode the "Gender" attribute 18 .This encoding mechanism solves the problem of ordinality, which can happen in categorical variables.Data scaling was performed using the standardization method 19 .When there is a considerable discrepancy in data points, the accuracy decreases.The classifiers also favour parameters with higher values, regardless of the units considered.Normalization and standardization are the two approaches utilized to scale the datasets in machine learning.Standardization was chosen in this study since they are better with outliers.The dataset was then split into training and testing in the ratio (80:20).There was a significant imbalance in the dataset.The number of severe COVID-19 cases was almost half compared to non-COVID-19 cases.The results obtained for the unbalanced data are completely biased since the models favour the majority classes.Hence, we used the oversampling technique called Borderline Synthetic Minority Oversampling Technique (SMOTE) to balance the training dataset 20 .This algorithm generates new synthetic samples using the K-nearest algorithm.The borderline cases are also handled well using the above technique.Under-sampling was not preferred in this study since we did not want to lose interesting trends and patterns.Further, the testing data was not balanced to protect data integrity.
Fourteen feature selection methods were used to choose the most important markers.Several metaheuristic nature-inspired algorithms have been utilized in this study.Feature selection is essential in machine learning since the classifiers perform better when removing redundant features.In this article, we have chosen multiple nature inspired algorithms.They have several advantages over traditional feature selection techniques.They are known for their global optimization, robustness, scalability, parallelism, adaptability, simplicity and stochasticity.Table 3 describes the features chosen by each algorithm.Among all the algorithms, the salp swarm optimization chose the maximum number of features (18).The whale optimization algorithm, flower pollination algorithm and mutual information chose 15 features.The sine cosine algorithm chose the minimum number of features (3).The Harris Hawk's optimization and particle swarm optimization chose six features each.The markers chosen www.nature.com/scientificreports/by the feature selection techniques are also described in Fig. 3. CRP was the most chosen feature since thirteen algorithms have included it.This was followed by neutrophils, NLR and AST, which were chosen 10, 9 and 8 times, respectively.The marker platelets were not chosen by any algorithm.

Machine learning concepts
Machine learning is a form of artificial intelligence that enables software programs to forecast predictive outcomes using past information as input.Several ML classifiers have been used in this study, such as random forest, decision tree, logistic regression, K nearest neighbors, catboost, adaboost, xgboost, lightgbm, stacking and voting algorithms.Stacking combines the result of multiple baseline models 35 .The stacking architecture consists of a classifier incorporating the initial model's predictions.Aggregation of the models are performed based on their weights, improving the model's accuracy.The meta-learner becomes a crucial factor in stacking.Logistic regression was the meta-learner used in this research.The stacking architecture is described in Fig. 4.
A voting classifier gathers training data from a large ensemble of classifiers, and predictions are made according to the class with the highest probability.It uses the concept of majority voting 36 .The voting algorithm is of two types: Hard-voting and soft-voting.The maximum number of votes is considered in hard-voting irrespective of the weights 37 ."Average probability" predicts the outcome of soft-voting 38 .The voting architecture is described in Fig. 5.
Further, the data was subjected to a fivefold cross validation technique.Here, various subsets of data are trained to validate the model efficiency.The input data is divided into five equal groups.Four groups are used for training, while the fifth group is used for testing using various permutations and combinations in crossvalidation.Hyperparameter tuning was performed to choose the best parameters using the grid search method.The performance of a classifier depends upon the hyperparameters chosen.Grid search automates the hyperparameter tuning and provides the best values as output.
We have chosen several classification and loss metrics to evaluate the models in this study.These include precision, recall, accuracy, F1-score, area under curve (AUC), average precision (AP), Mathew's correlation, log loss, Jaccard score and hamming loss.Emphasis has been given to precision and recall since they focus on false-positive and false-negative cases.
In this research, three state-of-the-art deep learning models have been tested.They are DNN, 1D-CNN and LSTM.A DNN consists of multiple input, hidden and output layers 39 .The essential function of a deep neural network is to take input, process them through more sophisticated computations, and predict results.CNNs are primarily used for image classification.However, 1D-CNN models have also been highly influential in classifying tabular data 40 .LSTMs are highly used in sequence prediction problems 41 .Three types of gates are considered in LSTM: input gate, output gate, and forget gate.LSTMs have proven to be highly efficient in handling time series data.
After training and testing the ML and DL models, five XAI techniques have been used to demystify the predictions.The results obtianed by the XAI techniques are in the form of graphs and tables, which can be easily understood by the ML users.The entire process-flow of this study is described in Fig. 6.

Ethical approval
Ethical clearance has been obtained to collect patient data from Manipal Academy of Higher Education ethics committee with id IEC: 613/2021.The need for informed consent was waived by the ethics committee/Institutional Review Board of Manipal Academy of Higher Education, because of the retrospective nature of the study.All methods were carried out in accordance with relevant guidelines and regulations.

Model testing
In this research, multiple machine learning and deep learning classifiers have been trained and tested to predict COVID-19 severity.The precision obtained by the models for various feature selection techniques is tabulated in Table 4.We emphasized the stacking and voting classifiers since they combine multiple models.From the table, it can be seen that the stacked model obtained the maximum precision of 94% after using mutual information.The soft-voting and hard-voting obtained a precision of 94% each.The bat algorithm performed well too.The stack, hard-voting and soft-voting classifier obtained a precision of 91%, 91% and 90%, respectively.The flower pollination algorithm was also efficient.The stack, hard-voting and soft-voting obtained a precision of 87%, 86% and 84%, respectively.The precision obtained for the stack, hard-voting and soft-voting after using the Jaya algorithm was 87%, 90% and 89%, respectively.The recall obtained by the models for all the feature selection techniques is described in Table 5.Mutual information was the best feature selection method.The recall obtained by the stack, hard-voting and soft-voting algorithms were 93%, 95% and 94%, respectively.The bat algorithm was the next best-performing model.The recall obtained by the stack, hard-voting and soft-voting models were 90%, 93% and 91%, respectively.The flower pollination algorithm performed well too.The recall obtained by the stack, hard-voting and soft-voting models were 86%, 90% and 90%, respectively.The recall obtained by the stack, hard-voting and soft-voting classifiers after using the Jaya algorithm was 87%, 91% and 90%, respectively.For further analysis, the best four feature The classification and the loss metrics are tabulated in Table 6.Mutual information performed the best among the four methods.The accuracy obtained for the stack, hard-voting and soft-voting classifiers were 90%, 95% and 94%, respectively.The bat algorithm was able to obtain excellent results too.The accuracies obtained by the stacking, hard-voting and soft-voting classifiers were 92%, 95% and 91%.The flower pollination algorithm performed relatively well.The accuracy obtained by the stacking, hard-voting and soft-voting classifiers were 87%, 85% and 86%.The accuracies obtained by the stack, hard-voting and soft-voting for the Jaya algorithm were 89%, 89% and 89%, respectively.
The ROC curves for the stacked model for the four feature selection methods are depicted in Fig. 7.The AUC was maximum for the mutual information algorithm with 0.96.The precision-recall curves for the stacked classifiers for the four feature selection methods are described in Fig. 8.The stacked model obtained a maximum average precision of 0,98 after being trained on features chosen by mutual information.
Further, the results obtained by the machine learning models were compared with the deep learning models.DNN, 1D-CNN and LSTM were the classifiers used in this study.The model architecture of the deep neural networks is described in Fig. 9.For the DNN, five layers were considered.The number of neurons used was 30,
The LSTM used four layers consisting of 150, 75, 50 and 1 neurons, respectively.The loss function used was "binary cross-entropy, and the optimizer was "Adam".The batch size was set to 32.
All three models were split into training and testing in the ratio of 80:20.The results obtained by the deep learning models are described in Table 7.Among the three, DNN performed the best, with an accuracy of 89%.www.nature.com/scientificreports/1D-CNN and LSTM obtained accuracies of 85% and 83%, respectively.The accuracy and loss curves for the models are depicted in Fig. 10.From the figure, the results obtained by the models are reliable and not overfitting.

Explainable artificial intelligence
In this study, five XAI methods: SHAP, LIME, QLattice, Eli5 and Anchor have been used to make the models more interpretable.We chose the stacked model for interpretation since they obtained good results and are generally reliable.Deep learning classifiers were not considered since many explainers do not support deep learning algorithms today.Further, machine learning algorithms performed better than deep learning models in this study.This is normal in artificial intelligence applications since deep learning models perform better only with comprehensive data.SHAP is a widely used XAI technique that makes global and local interpretations 42 .SHAP uses game and probability theory to understand the impact of each attribute.The global interpretation of the models is explained using beeswarm plots as described in Fig. 11.A hyperplane separates the non-severe (left) and severe classes (right).Red indicates a higher value, and blue indicates a lower value.The markers are also arranged based on their importance (The best feature remains at the top).The figure shows that the most important markers are basophils, CRP, LDH, lymphocytes, albumin, protein and ferritin.CRP, LDH and Ferritin levels increased in severe COVID-19 patients.Basophils, lymphocytes, albumin and protein levels decreased in severe COVID-19 patients.
Local interpretations can be explained using the SHAP force plot, as shown in Fig. 12. Figure 12a,c indicate a non-severe prognosis.It can be seen that markers such as lymphocytes, SPO2, basophils and CRP are pushing the predictions towards a non-severe prognosis.Figure 12b,d  LIME is another explainer used to make local interpretations 43 .It uses a model-agnostic approach (It works for most ML models).It uses a ridge regression model and kernels such as Gaussian and RBF to explain the predictions.The LIME interpretations are depicted pictorially in Fig. 13. Figure 13a  www.nature.com/scientificreports/ Eli5 is yet another method to demystify predictions 44 .It is a python package and is highly used with tree-based classifiers.Figure 14 depicts Eli5 predictions, and according to it, the most essential attributes are albumin, urea, lymphocytes, CRP, NLR, and basophils count.This explainer considers the "bias" (error rate).
Abzu developed the QLattice explainer 45 .It uses quantum computing and symbolic regression to explain the predictions.QLattice trains the models to understand the variation in data.The input attributes are called registers.A collection of registers is termed a QGraph.Every QGraph has a set of nodes (registers) and activation www.nature.com/scientificreports/functions.Activation functions such as add, multiply, log, sine, tanh and Gaussian are generally used.The QGraphs are described in Fig. 15.It can be seen that the most important markers are lymphocytes, CRP and D-Dimer.
Anchor is an XAI technique that uses rules and conditions 46 .The strength of an anchor is measured using its precision and coverage.Precision defines the accuracy of the anchor.Coverage determines how many instances utilize the same conditions.The anchors for non-severe and severe cases are described in Table 8.The most important markers are basophils, albumin, lymphocytes, CRP, D-Dimer, neutrophils, protein and NLR.
Five XAI techniques have been utilized and their findings are similar.The most important markers that can predict a patient's severity are CRP, lymphocytes, basophils, albumin, D-Dimer, NLR, and neutrophils.

Discussion
This research used multiple machine learning algorithms to predict severe COVID-19 cases in advance so that appropriate treatments could be provided for vulnerable patients.To demystify the predictions, five heterogenous XAI techniques were used.Doctors and medical professionals can easily understand the variation in the markers provided by the explainers.This decision support system can be setup in various medical facilities to aid healthcare workers.In developing countries, this application can be used to make judicious use of essential medical assets such as ICU beds, ventilators and medicines.The models can also be utilized to present a second opinion to the doctors.
Fourteen feature selection methods were utilized and we chose the best four for further analysis.They are mutual information, bat algorithm, flower pollination algorithm and Jaya algorithm.A maximum accuracy of 95% was obtained by the mutual information algorithm.The F1-score, AUC and AP were 94%, 0.98 and 0.99.When the bat algorithm was utilized, a 93% accuracy was obtained.The F1-score, AUC and AP were 92%, 0.97 and 0.94.When the flower pollination algorithm was used, an accuracy of 89% was obtained.The F1-score, AUC and AP were 88%, 0.95 and 0.97.When the Jaya algorithm was utilized, a 90% accuracy was obtained.The F1-score, AUC and AP were 88%, 0.95 and 0.97.Most machine learning models performed relatively well.
Several markers showed variation between the two cohorts.Among all, CRP was chosen by all the XAI techniques.CRP levels increased in severe COVID-19 patients in this study 47 .Lymphocyte levels decreased in severe  in the future to test real patients' prognosis.Our machine learning models could also be used for other diseases and public health issues [59][60][61][62][63][64][65] .

Conclusions
XAI is a part of machine learning, generally used to demystify the predictions made by the classifiers.In this study, we used several supervised learning algorithms and XAI techniques to predict the COVID-19 severity in advance.The patients vulnerable to severe COVID-19 symptoms can be identified early, and appropriate treatments can be provided to save them.Various patterns and trends in the clinical markers were observed using descriptive statistics in the initial part of this research.Multiple feature selection techniques, including natureinspired algorithms, were utilized to select the most crucial parameters.Several algorithms, such as bagging, boosting, stacking, voting and state-of-art deep learning, were used to make accurate predictions.The mutual information algorithm proved to be the most efficient feature selection technique obtaining a maximum accuracy of 95%.Five heterogeneous XAI algorithms such as, SHAP, LIME, QLattice, Eli5 and Anchor, have been used to understand the classification predictions.According to them, the most essential marker was CRP.Other markers such as D-Dimer, lymphocytes, neutrophils, albumin and basophils were also crucial.The classifiers can be utilized as a decision support system in various hospitals for prediction.The models can be used to predict the COVID-19 severity in advance.It can also aid the medical professionals and can offer them a second opinion.
The algorithms can also be used for a rapid diagnosis too.
In the future, cloud-based models can be deployed.They can easily store both the data and code more efficiently.High-end GPUs can be utilized to train deep learning algorithms.Other diagnostic methods, such as rapid antigen tests, chest X-rays and genome sequencing, can be combined suitably.Prognosis can be predicted for various COVID-19 variants.Electronic health records from multiple hospitals across various countries can be combined before training the models.Other deep learning techniques such as fuzzy ensembling techniques could be utilized.

Figure 2 .
Figure 2. Frequency distribution of the gender attribute.

Figure 3 .
Figure 3. Markers chosen by the feature selection methods.

Figure 4 .
Figure 4. Stacking methodology used in this research.

Figure 5 .
Figure 5. Voting methodology used in this research.

Figure 6 .
Figure 6.Machine learning methodology used in this research.
indicate a severe COVID-19 prognosis.Markers such as CRP, AST, basophils and lymphocytes push the predictions towards severe COVID-19.
,b predict a severe prognosis, and Fig. 13c,d indicate a non-severe prognosis.The attributes are also arranged based on the descending order of their importance.The figure shows that the most important markers are albumin, D-Dimer, LDH, CRP, basophils, protein, AST, SPO2 and lymphocytes.

Figure 10 .
Figure 10.Accuracy and loss curves obtained by the deep learning classifiers.(a) Accuracy curve for DNN (b) Accuracy curve for 1D-CNN (C) Accuracy curve for LSTM (d) Loss curve for DNN (e) Loss curve for 1D-CNN (f) Loss curve for LSTM.

Table 1 .
Attributes chosen in this study.

Table 3 .
Feature selection using several algorithms.

Table 4 .
Precision obtained by the classifiers for various feature selection methods (In %).

Table 5 .
Recall obtained by the classifiers for various feature selection methods (In %).

Table 6 .
Classification and loss metrics for the best four selection methods (In %).

Table 7 .
Classification and loss metrics obtained by the deep learning models.